Collecting and Analysing Chats and Tweets in SoNaR

نویسنده

  • Eric Sanders
چکیده

In this paper a collection of chats and tweets from the Netherlands and Flanders is described. The chats and tweets are part of the freely available SoNaR corpus, a 500 million word text corpus of the Dutch language. Recruitment, metadata, anonymisation and IPR issues are discussed. To illustrate the difference of language use between the various text types and other parameters (like gender and age) simple text analysis in the form of unigram frequency lists is carried out. Furthermore a website is presented with which users can retrieve their own frequency lists.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chats, Tweets and SMS in the SoNaR Corpus: Social Media Collection

In this paper we discuss the compilation of a social media corpus with chats, tweets and SMS text messages as part of the SoNaR corpus, a 500-million word reference corpus of written Dutch, comprising many different text categories. Social media are more and more becoming part of everyday life, which makes the need for social media corpora an urgent matter for research. Special focus was addres...

متن کامل

Toward a Comparable Corpus of Latvian, Russian and English Tweets

Twitter has become a rich source for linguistic data. Here, a possibility of building a trilingual Latvian-Russian-English corpus of tweets from Riga, Latvia is investigated. Such a corpus, once constructed, might be of great use for multiple purposes including training machine translation models, examining cross-lingual phenomena and studying the population of Riga. This pilot study shows that...

متن کامل

Effects of Receiving Corrective Feedback through Online Chats and Class Discussions on Iranian EFL Learners' Writing Quality

Giving corrective feedback (CF) is an essential part of the teaching and learning process, and the way it should beneficially be done has been the focus of attention for numerous researchers especially when traditional ways of CF provision are not possible, particularly in rare situations such as outbreaks of diseases. This study investigated how different ways of giving feedback; namely, throu...

متن کامل

(Dis)agreements in Iranians’ Internet Relay Chats

The present study on politeness is an attempt to examine (dis)agreeing strategies utilized by EFL learners while chatting on the internet. Subjects of the study were forty male and thirty-three female Iranian natives whose internet relay chat (IRC) interactions, composed of 400 excerpts, were collected between December 2007 and September 2008. Data analysis was based on the general taxonomy of ...

متن کامل

Detection of Twitter Users' Attitudes about Flu Vaccine based on the Content and Sentiment Analysis of the Sent Tweets

Introduction: The influenza vaccine is one of the controversial challenges in today's societies. Considering the importance of using the flu vaccine in preventing the spread of influenza virus, the Twitter network, as a rich source of data, provides suitable conditions for research in this field to examine the attitudes of different people about this vaccine. The results in one hand will help h...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012